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This is a follow-up sensitivity study on r-mode gravitational wave signals from newborn neutron stars 
illustrating the applicability of machine learning algorithms for the detection of long-lived gravitational-wave 
transients. In this sensitivity study we examine three machine learning algorithms (MLAs): artificial neural 
networks (ANNs), support vector machines (SVMs) and constrained subspace classifiers (CSCs). The objective 
of this study is to compare the detection efficiency that MLAs can achieve with the efficiency of conventional 
detection algorithms discussed in an earlier paper. Comparisons are made using 2 distinct r-mode waveforms. 
For the training of the MLAs we assumed that some information about the distance to the source is given so that 
the training was performed over distance ranges not wider than half an order of magnitude. The results of this 
study suggest that machine learning algorithms are suitable for the detection of long-lived gravitational-wave 
transients and that when assuming knowledge of the distance to the source, MLAs are at least as efficient as 
conventional methods. 


I. INTRODUCTION 


It was demonstrated (1998) that neutron star r-mode oscilla¬ 
tions of any harmonic, frequency and amplitude belong to the 
group of non-axisymmetric modes that can be driven unstable 
due to gravitational radiation Emin- This theory suggested 
that (assuming the r-mode oscillation amplitude grows suffi¬ 
ciently large) r-mode gravitational radiation (primarily in the 
m = 2 harmonic) could carry away most of the angular mo¬ 
mentum of a rapidly rotating newborn neutron star in the form 
of r-mode gravitational radiation. 

In a previous paper J9] we presented a sensitivity study 
of a seedless clustering detection algorithm based on r-mode 
waveforms predicted by the Owen et. al. 1998 model 
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and the gravitational radiation power given by 

Ek 3.5 x 10 19 / 8 H 2 W. (3) 


This model depends on two parameters: the (dimensionless) 
saturation amplitude, a, of the r-mode oscillations and the ini¬ 
tial gravitational wave spindown frequency f a . The theoreti¬ 
cal predictions for the values of these parameters were dis¬ 
cussed extensively in our previous paper. The values we con¬ 
sidered for a lie in the range of 10 3 — 10 _1 while the values 
we considered for f 0 lie in the interval of 600 — 1600 Hz. Due 
to the wide range within which the values of these parameters 
lie, we cannot effectively use a matched filtering algorithm. 
Instead, we have to develop techniques that could detect all 
possible distinct waveforms. 


In our previous paper, a seedless clustering algorithm was 
used. This algorithm is based on the statistical significance of 
signal to noise ratios (snr) of clusters of above the threshold 
snr pixels. This method is not dependent on any knowledge 
of the signal and it can be applied genetically to any long- 
lived gravitational-wave transients. In particular, it is unable 
to discriminate between r-modes and other possible gravita¬ 
tional wave sources. Knowledge of the r-mode signal can be 
used to make minor modifications in the clustering algorithm, 
however, there was not much hope for a dramatic improve¬ 
ment in the efficiency. Nevertheless, we were able to recover 
signals of magnitude 5 times weaker than the noise. 

In the sensitivity study performed for the clustering algo¬ 
rithm, we used 9 distinct waveforms. These were chosen by 
taking ( a , f Q ) pairs using 3 values (10 _1 , 10~ 2 , 10~ 3 ) for 
a and 3 values (700 Hz, 1100 Hz, 1500 Hz) for f a . In this 
sensitivity study for the MLAs, for comparison purposes, we 
used 2 of these waveforms: (/ D = 1500 Hz, a = 0.1) and 
( f Q = 1100 Hz, a = 0.01). These waveforms as well as their 
corresponding power decays are shown in Fig[I]and Fig(2]re- 
spectively. MLAs are well suited especially for cases where 
the signal is not precisely (but only crudely) known. This pa¬ 
per is based on three specific MLAs: ANN 0, SVM a and 
CSC f8j]. All three methods are considered novel applications 
in the area of long transient gravitational wave searches. 

This paper is designed as follows: The next section de¬ 
scribes the preparation of the data used for the training of the 
MLAs. The discussion extends to the resolution reduction of 
the data maps and also to the training efficiencies as a function 
of resolution reduction for all three MLAs. In section 3 we 
present a brief introduction into ANN methods, in section 4 
we do the same for SVM methods and in section 5 we present 
a similar introduction for CSCs. In section 6 we present and 
discuss our results and compare the MLA efficiencies to the 
conventional (clustering) algorithms efficiencies. Finally, in 
section 7, we summarize our conclusions including topics for 
future work. 
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Waveforms with f Q — 1500Hz, a — 10 1 and f 0 — 1100Hz, a — 0.01 
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FIG. 1. The (/ 0 = 1500 Hz, a = 0.1) waveform is the most power¬ 
ful waveform considered in our sensitivity studies both for the clus¬ 
tering and the MLAs. The second waveform we chose has an ampli¬ 
tude 25 times smaller than the first one. This waveform has param¬ 
eters (/ 0 = 1100 Hz, a = 0.01) and is approximately monochro¬ 
matic for the durations our sensitivity studies were designed for. The 
clustering algorithm could detect the weaker signal at distances not 
further than a few Kpcs. 
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FIG. 2. These power evolutions correspond to the signals in Fig|T] 
We see that the blue plot corresponds to a rapidly decaying signal: 
within 2500 s the radiation power drops to 17% of its original value. 
The red plot corresponds to a power decay that dropped to only 
99.9% of the initial. The power of this (red) signal is about 3 orders 
of magnitude lower than that of the signal plotted in blue. We chose 
this weak signal so that we can examine how the MLAs compare to 
the clustering algorithm both for powerful and weak signals. 


II. DATA PREPARATION 


A. Choice of the f 0 and a parameter values 


From equations 0 and Q we have the two model parame¬ 
ters a and f a that determine the shape of the waveform. Apart 
from the shape, the injections that were produced for the train¬ 
ing of the MLAs were also dependent on the pixel brightness 
or pixel signal-to-noise ratio (SNR). For a single pixel in the 
frequency-time maps (ft-maps) the SNR satisfies nm 


SNR(f, /, 0) ex Re 




(4) 


where i = 1 and j = 2 are indices corresponding to the two 
advanced LIGO (aLIGO) detectors (T|, Q,j(t. f. Cl) is a fil¬ 
ter function that depends on the source direction, Cl, [2j and 
Cij = 2h*(t, f)hj(t, f) is the cross spectral density, h being 
the Fourier transform of the gravitational wave strain ampli¬ 
tude h. The latter is expressed in |[6j as a function of the dis¬ 
tance d to the source, the gravitational-wave frequency / and 
the r-mode oscillation amplitude a, by 


h 


1.5 x 1CT 23 




(5) 


For the construction of the injection maps we chose the 3 pa¬ 
rameter values f 0 , a and h 2 to be uniformly distributed within 
the corresponding predetermined ranges. The values of the 
distance d are constrained accordingly such that the above 
conditions are satisfied. 

Each injection set that was produced and used for the MLA 
training was limited to 11350 injection maps and 11350 noise 
maps. This was due to the finite amount of data that was pro¬ 
duced during the S5 LIGO run, the computational resources 
available as well as the time needed to produce the 22700 
maps. For each injection the waveform was randomly cho¬ 
sen in such a way that the a value was randomly chosen 
from a uniform distribution of 11350 a values in the range 
of 10 -3 — 10 _1 , the f a value was randomly chosen from 
a uniform distribution of 11350 f 0 values in the range of 
600 — 1600 Hz, and for the h 2 values we picked 3 ranges (for 
3 separate MLA trainings), whose choice is discussed in the 
next paragraph. 


B. Choice of the h 2 parameter values 

The results of the sensitivity study for the clustering algo¬ 
rithm showed that for a signal of / = 1500 Hz and a = 0.1 
the detection distance was up to 1.2 Mpc. Using equation 
0 we see that the conventional algorithms can detect grav¬ 
itational wave strains of value h ss 4 x 10 -24 . Values of 
the same order are obtained if we substitute the results for 
the other 8 waveforms. For example from table 1 in J9) we 
see that for / = 700 Hz and a = 0.01 we get a detection 
distance of 0.043 Mpc. Substituting in equation (|5]i we get 
h = 1.2 x 10 -24 . Therefore, the value of h « 10 -24 will 
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become a reference point because this is the value of gravi¬ 
tational wave strain the MLAs will have to detect if they are 
shown to be at least as efficient as the conventional clustering 
algorithms fl2l . 

If we consider supernova events at distances in the range 
from 1 Kpc to 1 Mpc then the corresponding range for the 
gravitational wave strain values is h « 10 -24 to 10 -21 . 
Therefore, there are several approaches in determining the 
range of h 2 values for the injection maps produced for the 
training of the MLAs: 

(i) produce one set of data with injections at distances dis¬ 
tributed in such a way that the h 2 values are uniformly dis¬ 
tributed in the range from 10~ 48 to 10 -42 (i.e. 10” 24 < h < 
1CT 21 ). 

(ii) produce three sets of data such that the h 2 values are 
uniformly distributed in the following ranges: 

(a) from 10" 46 ' 4 to 10" 45 - 4 (i.e. 1CT 23 ' 2 < h < 10" 22 ' 7 ) 

(b) from 10" 47 ' 4 to 10" 46 ' 4 (i.e. KT 23 ' 7 < h < 1CT 23 ' 2 ) 

(c) from lO -48 0 to 10" 47 - 4 (i.e. lO” 240 < h < 10" 23 7 ) 

The last choice of 10 -24 is such that the waveform with 

(fo = 1500 Hz, a = 0.1) may be detectable up to a distance 
of 5 Mpc, depending on the MLA detection efficiencies. Note 
that at those distances (in the neighborhood of Milky Way) the 
supernova event rate is 1 every 1-2 years. 


C. Production of the data matrix for the MLA training 


We start with the data maps in the frequency-time domain 
(ft-maps) produced by the stochastic transient analysis multi¬ 
detector pipeline (STAMP) lUTIl : 11350 noise maps and 11350 
(r-mode) injection maps. These ft-maps are produced us¬ 
ing time-shifted S5 data recolored with the aLIGO sensitiv¬ 
ity noise curve. Each map has a size of 1001 x 4999 pix¬ 
els with each pixel along the vertical axis corresponding to 
1 Hz and each pixel along the horizontal axis correspond¬ 
ing to 0.5 s, hence the length of the map is 2499.5 s. This 
ft-map is reshaped to a 1 x 5003999 row vector that oc¬ 
cupies a disc space of 37.4MB. We reshaped all 22700 ft- 
maps (each one of size 1001 x 4999) and produced 22700 
row vectors x, t with i £ {1, 2,..., 22700}. The rows with 
i £ {1, 2,..., 11350} correspond to the noise ft-maps while 
the rows with i £ (11351,11352,..., 22700} correspond to 
the injection ft-maps. 

We then used the rows x\ with i £ (1, 2,..., 11350} to pro¬ 
duce a 11350 x 5003999 noise data matrix, X\, given by 


f Xl \ 

*2 


X i = 


V *11350 / 


( 6 ) 


and we also used the rows with Xi with i £ 
(11351,11352,...,22700} to produce a 11350 x 5003999 


injection data matrix, X 2 , given by 


*2 


/ *11351 \ 
*11352 


(7) 


V *22750 / 


The MLAs would take as an input the 22700 x 5003999 data 
matrix given by 



Each row Xi with i £ (1,2,.., 22700} of the data matrix X 
corresponds to a single ft-map. The total number of rows is 
equal to the number of data points, n = 22700, while the total 
number of columns (i.e. the total number of features) is equal 
to D = 5003999, where D is the dimensionality of the feature 
space in which each single ft-maps lives. 

For any matrix we know that row rank = column rank, 
therefore, the number of linearly independent columns of X 
is equal to 22700. This number is determined by the lim¬ 
ited number (n = 22700) of ft-maps we could produce. 
This means that even though each single ft-map lives in a 
5003999—dimensional space, we can only approximate these 
ft-maps as vectors living in a 22700-dimensional space (sub¬ 
space of the 5003999—dimensional space). The best ap¬ 
proximation of this subspace would be the one in which the 
most ’dominant’ 22700 features (out of the total number of 
5003999) constitute a basis of the subspace. A well known 
method of choosing the 22700 most dominant features is de¬ 
scribed by the principal component analysis (PCA) 03 or see 
section [VB} However, the data matrix X is of size 848.8 GB 
and this makes the (RAM) memory required to perform PCA 
on X beyond 1TB, making it practically impossible to per¬ 
form PCA on X with realistically available computing re¬ 
sources. 

A reliable approach to solve the problem of the high di¬ 
mensionality of the features (D n) is to seek MLAs that 
will naturally select d-many features (with d<i)) such that 
d < n DU. Three classes of MLAs that can achieve this 
are the ANN, SVM and CSC methods. However, the data 
matrix is too large to attempt to perform any MLAs on it. 
Therefore, the only way out of these restrictions the data ma¬ 
trix size imposes, is to perform resolution reduction for each 
1001 x 4999 ft-map (before reshaping each one of them to a 
row vector). After the resolution reduction, performing fur¬ 
ther feature selection would still benefit the training of the al¬ 
gorithms in terms of speed. The right choice of features can 
significantly decrease the training time without noticeably af¬ 
fecting the training efficiencies. 

A resolution reduction on the ft-maps would result in 22700 
1 x d row vectors such that d <C I). The desired effect of the 
resolution reduction would be to get d < n. The first guess for 
such a reduction would be to choose a factor of D/n ~ 220. 
That would be equivalent to a reduction by a factor of ~ 15 
along each axis (frequency and time) of the ft-map. However, 
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it turned out that this is not the optimal resolution (per axis) 
reduction factor. The following two sub-sections describe the 
experimentation on the reduction factor. 


D. Resolution reduction: bicubic interpolation 

To perform the resolution reduction, we used the imresize 
matlab function. The original ft-map of 1001 x 4999 pix¬ 
els consists of a 1002 x 5000 point grid. Imresize will first 
decrease the number of points in the point grid according to 
the chosen resolution reduction factor, r. When r = 10 -2 , 
the imresize function gives a new ft-map of dimensionality 
11 x 50, the new point grid will be 12 x 51. Interpolation is 
then used to calculate the surface within each pixel in the new 
point grid. 

We used the bicubic interpolation option of the imresize 
function. According to this, the surface within each pixel can 
be expressed by 

3 3 

= (9 > 

i—0 j—0 

The bicubic interpolation problem is to calculate the 16 ay- 
coefficients. The 16 equations used for these calculations 
consist of the following conditions at the 4 corners of each 
pixel: 

(a) the values of S(t, f) 

(b) the derivatives of S(t, f) with respect to t 

(c) the derivatives of S(t, f) with respect to / and 

(d) the cross derivatives of S(t, f) with respect to t and / 

Determining the resolution reduction factor that would 
yield the best training efficiencies for the MLAs was not a 
very straight forward task. To do so we performed a series 
of tests using the set of 11350 noise ft-maps and the set of 
11350 injection ft-maps. The injected signal SNR values lay 
in a range such that 10 -23 7 < h < 10~~ 23 ’ 2 . 


E. Resolution reduction versus training efficiency 

We tested 5 different resolution reduction factors (r = 
lO -1 , r = HT 1 ' 5 , r = 10" 2 , r = 10" 2 ' 5 and r = 10" 3 ) 
where the value of r corresponds to the factor by which each 
axis resolution is reduced. The resulting data matrices had 
dimensions 22700 x 50500, 22700 x 5155, 22700 x 550, 
22700 x 64 and 22700 x 10 respectively. Subsequently each of 
the three MLAs were trained and the training efficiencies were 
plotted against the resolution reduction factors. The results are 
shown in Fig[3] From the plots we see that the training effi¬ 
ciencies first improve as we lower the resolution. For too low 
or too high resolution reductions the training efficiencies de¬ 
crease. This behavior was consistent on all three MLAs. At a 
reduction factor of 100 per axis we have the maximum train¬ 
ing efficiency. Resolution reduction offers two advantages: (a) 
it increases the MLA training efficiency and (b) it reduces the 
training time. Using the results from Fig|3]we determined that 


Deter mining the ft-map resolution reduction for the MLA trai ning 



FIG. 3. Training efficiencies of ANN (blue), SVM (green) and CSC 
(red) versus the resolution reduction (per axis). For SVM and CSC 
there is a clear peak at a resolution reduction factor of 10~ 2 . The 
ANN peak seems to be a little off but for uniformity we used 1CL 2 
for all 3 MLAs. The training of all three MLAs was performed using 
the (a = 0.1, f 0 = 1500) waveform. No tests have been performed 
to verify the validity of these plots for other waveforms or other h 
value ranges. 


the best resolution reduction would be the factor of r = 10 -2 . 
This results in a data matrix with dimensions of 22700 x 550 
(disc space of 84MB). 

After dimensionality reduction, the matrices X\ and X- 2 as 
given by equations <0 and 0 respectively become X\ and 
X' 2 each one of a reduced dimensionality 11350 x 550. We 
define these matrices as follows 


( Xl \ 

X2 


X 


/ _ 

1 - 


V £11350 / 


( 10 ) 


with row vectors Xi £ R 550 where i £ { 1,2,..., 11350} and 


*2 


/ £11351 \ 
£11352 


(ID 


V £22750 / 


with row vectors x,j. £ R 550 where i £ {11351,..., 22700}. 
Similarly we define the dimensionally reduced 22700 x 550 
data matrix 

*'=(}]). ( 12 , 

The number of rows, n = 22700, is the number of data points 
(ft-maps) and the number of columns, d = 550, is the number 
of features of each point or the dimensionality of the space in 
which each ft-map lives (after the resolution reduction). 
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III. ARTIFICIAL NEURAL NETWORKS 

The 22700 x 550 data matrix, X', will be presented as in¬ 
put into a feed-forward neural network with an input layer of 
dimensionality equal to the number of columns of the data ma¬ 
trix (i.e. d = 550). For the training of the ANN we randomly 
picked 90% of the first 11350 (injection data) rows and also 
90% of the second 11350 (noise data) rows. The other 10% of 
the (injection and noise) rows was used to determine the train¬ 
ing efficiency of the trained algorithm. The ANN had one hid¬ 
den layer with 50 nodes (‘neurons’) and an output layer with a 
two ‘neurons’ that would ‘fire’ for ‘signal’ or ‘no signal’, re¬ 
spectively. The ‘hidden’ layer used ‘neurons’ with the logistic 
sigmoid function m 


a(aj) 


1 

1 + exp(— a,j) 


(13) 


where a,j ( j = 1,2,..., 550) are the values presented at one 
‘neuron’ in the hidden layer. The purpose of the hidden layer 
is to allow for non-linear combinations of the input values 
to be forwarded to the output layer. These combinations in 
the hidden layer carry forward ‘features’ from the input to the 
output layer that would not be possible to be extracted from 
each individual neuron in the input layer, enabling non-linear 
classification. The number of hidden layers and hidden neu¬ 
rons was chosen, as is typically done, after experimentation 
with various ANN architectures, aiming to enhance the accu¬ 
racy, the robustness and the generalization ability of the ANN, 
along with the training efficiency and feasibility. 

Starting from the first ft-map in the data matrix X i.e. start¬ 
ing from the row vector X\ where 


Xi = { Xij\ i = 1 and j = 1, 2,..., 550} (14) 


we have 550 values that are fed into the input layer of the 
neural network. These values are then non-linearly combined 
in each hidden ‘neuron’ to get 50 output values forwarded to 
the output layer, given by 


550 

x 'lk = W kj X U + W k0 ) ( 15 ) 

3 = 1 


where fc = 1,2,..., 50 is the index corresponding to each ‘neu¬ 
ron’ in the hidden layer and the superscript (1) represents the 
hidden layer. The parameters Wkj are called the weights while 
the parameters Wk o are called the biases of the neural network. 

The ‘output’ layer used ‘neurons’ with the soft-max acti¬ 
vation function which is typically used in classification prob¬ 
lems to achieve a 1-to-n output encoding Da. In particular, 
the soft-max function rescales the outputs in order for all of 
them to lie within the range [0,1] and to sum-up to 1. This 
is done by normalizing the exponential of the input bk to each 
output neuron over the exponential of the inputs of all neurons 
in the output layer: 

soft-max(6 fe ) = (16) 

E fc (exp(6fc)) 


When the values from equation ( p~5] > are presented in the out¬ 
put layer we get the result 


50 

x'ii = soft-max W$x' lk + w $) (17) 

k =1 


(where l = 1, 2) as the output value in the single neuron of 
the output layer. Equation ( fT7j ) represents the ‘forward prop¬ 
agation’ of information in the neural network since the inputs 
are ‘propagated forward’ to produce the outputs of the ANN, 
according to the particular ‘weights’ and ‘biases’. 

Equation ( fl7] > also shows that a neural network is a non¬ 
linear function, T, from a set of input variables {x^ such 
that i £ {1,2,..., 22700} as defined by equation © i- e - 
are row vectors of the matrix X' to a set of output variables 
{a;"} such that l £ {1, 2} i.e. the output layer has 2 neurons 
(one fires for noise and the other fires for injection). To merge 
the weights and biases into a single matrix (and 

similarly do with the weights w lk and biases w\ Q ) we need 
to redefine x \ as given by equation ( [14} to 

Xi = { Xij\ i = 1 and j = 0,1, 2,..., 550 and x\q = 1} 

(18) 

and similarly redefine all row vectors of X' as well as all the 
output row vectors from the hidden layer. Then the non-linear 
function T is controlled by a (50 + 1) x (550 + 1) matrix 
w^ 1 ' and a 2 x (50 + 1) matrix of adjustable param¬ 
eters. Training a neural network corresponds to calculating 
these parameters. 

Numerous algorithms for training ANN exist m and in 
general can be classified as being either sequential or batch 
training methods: 

(i) sequential (or ‘online’) training: A ‘training item’ consists 
of a single row (one ft-map) of the data matrix. In each iter¬ 
ation a single row is passed through the network. The weight 
and bias values are adjusted for every ‘training item’ based on 
the difference between computed outputs and the training data 
target outputs. 

(ii) batch training: A ‘training item’ consists of the matrix 
X' (all 22700 rows of the data matrix). In each iteration all 
rows of X' are successively passed through the network. The 
weight and bias values are adjusted only after all rows of X' 
have passed through the network. 

In general, batch methods perform a more accurate estimate 
of the error (i.e. the difference between the outputs and the 
training data target outputs) and hence (with sufficiently small 
learning rate D2) they lead to a faster convergence. As such, 
we used a batch version of gradient descent as the optimiza¬ 
tion algorithm. This form of algorithm is also known as ‘back- 
propagation’ because the calculation of the first (or hidden) 
layer errors is done by passing the layer 2 (or output) layer 
errors back through the w <2! matrix. The ‘back-propagation’ 
gradient descent for ANNs in batch training is summarized as 
follows: 
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Algorithm 1 Gradient Descent for ANN 

1. Initialize w (and biases) randomly. 

while error on the validation set satisfies certain criteria do 
for i= 1:22700 do 

2. Feed-forward computation of the input vector Xi. 

3. Calculate the error at the output layer. 

4. Calculate the error at hidden layer. 

end for 

5. Calculate the mean error. 

6. Update w of the output layer. 

7. Update w of the hidden layer. 

end while 


Out of the 90% of the data that was (randomly) chosen for 
the training, 10% of that was used as a validation set. The 
latter is used in the ‘early stopping’ technique that is used to 
avoid over-fitting and maintain the ability of the network to 
‘generalize’. Generalization is the ability of a trained ANN to 
identify not only the points that were used for the training but 
also points in between the points of the training set. For each 
iteration the detection efficiency of the ANN is tested on the 
validation set. When the error on the validation set drops by 
less than HUfor two consecutive iterations then we do the 
‘early stopping’ and the training is stopped. 

The learning rate of the gradient-decent algorithm deter¬ 
mines the rate at which the training of the network is moving 
towards the optimal parameters. It should be small enough 
not to skip the optimal solution but large enough so that the 
convergence is not too slow. A crucial challenge for the algo¬ 
rithm is not to converge to local minima. This can be avoided 
by adding a fraction of a weight update to the next one. This 
method is called ‘momentum’ of the training of the network. 
Adding ‘momentum’ to the training implies that for a gradi¬ 
ent of constant direction the size of the optimization steps will 
increase. As such, the momentum should be used with rela¬ 
tively small learning rate in order not to skip the optimal solu¬ 
tion. After experimenting with various parameter populations 
we used a learning rate of 0.02 and a momentum of 0.9. 


IV. SUPPORT VECTOR MACHINE 

The second MLA we trained is a support vector machine 
(SVM). This method gained popularity over the ANNs be¬ 
cause it is based on well formulated and mathematically sound 
theory lfl 6 l . In the following paragraphs we give a brief intro¬ 
duction to the SVM mathematical formulation. 

We start from the data matrix X' as given by equation 
( fl2| i. In the SVM formulation we treat the noise ft-maps, 
rows of X\ given by equation ( | 1 ()[ >, as well as the ft-maps 
with r-mode injections, rows of X' 2 given by equation HD- 
as points in a 550-dimensional space. The idea behind the 
formulation of the SVM optimization problem is to find the 
optimal hypersurface that would separate (and hence classify) 
the noise points from the injection points. For this discussion 
we will need the following definitions: 

Definition 1: The distance of a point Xi to a flat hypersurface 


H = {x|(u;, x) + b = 0 } is given by 


d Xi (w,b) = Zi x ((w,Xi) + b) (19) 

where w is a unit vector perpendicular to the flat hypersurface, 
b is a constant, and z* = +1 for (w, Xi) + b > 0 and Z; = —1 
(w, Xi) + b < 0. The index i (in xi) takes values from the set 
{1, 2, 3,..., 22700}. In the following discussion each point 
Xi that lies above the hypersurface pairs with a value z, = 1 
and each point x, that lies below the hypersurface pairs with 
a value of Zj = — 1 . 

Definition 2: The ‘margin’, 75 (w,b), of any set, S , of 
vectors is defined as the minimum of the set of all distances 
T> = {d Xi (w,b)\xi £ 5} from the hypersurface T-L. For the 
purpose of our discussion the set S is the union of the set of 
all noise points and the set of all injection points. 

For definition 3 we assume that a training set consists of 
points Xi with each one belonging to one of two distinct data 
classes denoted by yi = 1 (for one class) and y t = — 1 (for the 
other class). We may further assume that the set of all noise 
points belongs to the class represented by yi = — 1 while the 
set of all injection points belongs to the class represented by 
Vi = +1- 

Definition 3: A training set {(27, yi), (x„, y n )\xi £ 
R d ,j/i £ { — 1 ,+ 1 }} is called ‘separable’ by a hypersurface 
T~L = (x|(to,x) + 6 = 0 } if both a unit vector w (||w|| = 1 ) 
and a constant 6 exist so that the following inequalities hold: 

(w, Xi) + b> 75 if yi = +1 ( 20 ) 

(w, Xi) + b < —75 if yi = - 1 ( 21 ) 

where S = {x$| i = 1, 2,..., n} and 75 is given by definition 
2 . 


For the purpose of our discussion d = 550 is the dimen¬ 
sionality of the points x, (this dimensionality corresponds 
to the number of pixels in each ft-map) and n = 22700 
is the number of our (ft-maps) data points. Using the fact 
that the hypersurface is defined up to a scaling factor c, i.e. 
H = {x| {cw, x) + cb = 0}, we can take c such that C 75 = 1 
and hence we can rewrite equations ( | 20 ] > and ( |2T| as 

Hi x (( cw,Xi) + cb) > 1 for all i=l,2,...,n. (22) 

Defining w' = cw i.e. ||u; / || = c and dividing equation ([22]) 
by c we get 

w' 1 

Vi x (<Ti — 7TT’+ & ) ^ 71—FT for all i=l,2,...,n. (23) 

If II If II 


Formulation of the SVM optimization problem: Given a 


training set, that is, a data matrix X' = 


X'i 

*2 


, X[ being the 


noise points and X' 2 being the injection points, we want to find 
the ‘optimal separating hypersurface’ (OSH), that separates 
the row-vectors of A'} from the row-vectors of X 2 . Accord¬ 
ing to definition 3, this translates to maximizing the ‘margin’ 
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7 s- In other words, we want to find a unit vector w and a con¬ 
stant b that maximize pjrjy- Therefore, the SVM optimization 
problem can be expressed as follows 

min -1| w' || 2 subject to (24) 

w,b 2 

1 — Vi x {(w',Xi) + b') < 0 for all i=l,2,...,n (25) 

where b' = cb. This is a quadratic (convex) optimization prob¬ 
lem with linear constraints and can be solved by seeking a so¬ 
lution to the Lagrangian problem dual to equations ( [24] ) and 

©• 

Before formulating the Lagrangian dual we introduce the 
‘slack variables’, 6 (i = 1, 2 ,n), that are used to relax the 
conditions in equation ( |22[ and account for outliers or ‘errors’. 
Instead of solving equation ( [24] ) we seek a solution to 

1 ” 

min -||u/|| 2 + 6 subject to 

w.b 2 ^ J 

2 = 1 

6 > 0 and 1 — y-i x ((«/, 27) + b') — 6 < 0 for all i=l,..,n. 

(26) 

The slack variables 6 measure the distance of a point that 
lies on the wrong side of its ‘margin hypersurface’. Using the 
Lagrange multipliers 

oti> 0 and /3j > 0 (27) 

the Lagrangian dual formulation of equation ( [26] ) is to maxi¬ 
mize the following Lagrangian 

1 n n 

C(w6 , & , a, /3) =-||^'|| 2 + C 6 - E P&+ 

i=1 i=1 

n 

+ -t/jX ((w',Xi) + b) - £i). 

i=1 

(28) 


Using the stationary first order conditions for w', b and 6 


«9£ 


^7 = w j-'52 otiyiXij = 0, Vj = 1,2,... d, 


dW 3 

dC 

2=1 

o n 

— = C - on - fa = 0, Vi = 1,2,... n. 

<96 


= 52 ai2/i = 


(29a) 

(29b) 

(29c) 


(where Xij is the j th entry of the x-i data point) the Lagrangian 
dual as given in expression ( [28] ) can be re-expressed only in 
terms of the a.i Lagrange multipliers, as follows 


n 


1 . , 

2 5 , (T/ ' ^’ 7 ') 

»J=1 


(30) 


and hence we can evaluate the a* Lagrange multipliers by 
solving the following optimization problem 


n 


ma xjC(ai) subject to 7 cayi = 0, 

OLi ^ 

2=1 

0 < aj < C , Vi = 1 , 2,.., n.. 


(31a) 

(31b) 


Defining Gij = yiUjX^Xi problem (fTTa]-(|3Tbl) is equiva¬ 
lently expressed as 

min -a T Ga-e T tt (32a) 

oti£ TZ n 2 

subject to y 1 a = 0 (32b) 

and 0 < cti < C , i = 1, 2, ...,n (32c) 

where e T is a n— dimensional row vector equal to e T = 
(1,1,..., 1) and ( |32c| ) is derived from ( |29c[ > together with ( f27| . 

Since the objective function in equation ( [32] ) is quadratic 
and all the constraints are affine, the problem defined by these 
equations is a quadratic optimization problem. Using the fact 
that (by constrution) the sum of all the entries of G can be 
written as a sum of squares and also using that on > 0 we 
can see that G is positive semidefinite, which implies that the 
problem is convex. Convex problems offer the advantage of 
global optimality; that is any local minimum is also the global 
one. Several methods have been proposed for solving such 
problems including primal, dual and parametric algorithms 

0 a. 

After solving the optimization problem defined by ex¬ 
pressions ( |32a| )-( |32c] i, i.e. after evaluating all the 0:, (1 = 
1, 2, ..., 71), we can find the vector w using ( |29a| ). The con¬ 
stant b can be found by using the Karush-Kuhn-Tucker (KKT) 
complementarity conditions G£|, 


cti{—1 + yi x ((«/, Xi) + b') + 6} = 0 (33a) 

Mi = 0 (33b) 

along with equation ( |29c| ). For any a:, satisfying 0 < a t < C, 
equation \29c) implies that B t > <1 and hence ( |33b[ ) implies 
that 6 = 0. Consequently, we can use the x, corresponding 
to the aformentioned a, to solve equation ( |33a| ) for b'. 

Having calculated the vector w' and the constant b' is equiv¬ 
alent to knowing the hypersurface defined by (■ wXi)+b' = 0. 
During the testing phase a new data point, 27, is classified ac¬ 
cording to 


class^i) = sgn((«/, Xi) + b’). (34) 

For class(Xi) = — 1 we classify the x, point as noise and for 
class(a;i) = +1 we classify the x t point as injection. 

We choose to solve the convex quadratic problem as de¬ 
fined in equation ( [32] ) with sequential minimal optimization 
(SMQ) lf20l . SMO modifies only a subset of dual variables 07 
at each iteration, and thus only some columns of G are used 
at any one time. A smaller optimization subproblem is then 
solved, using the chosen subset of o:,. In particular at each 
iteration only two Lagrange multipliers that can be optimized 
are computed. If a set of such multipliers cannot be found then 
the quadratic problem of size two is solved analytically. This 
process is repeated until convergence. The integrated software 
for support vector classification (LIBSVM) l2ll is a state of 
the art SMO-type solver for the quadratic problem found in 
the SVM formulation. SMO outperforms most of the exist¬ 
ing methods for solving quadratic problems 1221 . Hence we 
choose to use it for training the SVM, using the LIBSVM rou¬ 
tine ‘svmtrain’. 
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Non-linear SVM: The soft margins can only help when 
data are ‘reasonably’ linearly separable. However, in most 
real world problems, data is not linearly separable. To deal 
with this issue we transform the data into a ‘feature’ (Hilbert) 
space, 7~L, (a vector space equipped with a norm and an inner 
product), where a linear separation might be possible due to 
the choice of the dimensionality of H, dim("H) > dim(R d ). 
The transformation is represented by 


determine the value of the parameter C, we plotted training 
efficiencies against several values of C. We determined that 
C should be in the range of 10 4 — 10 5 . All experiments with 
SVM are conducted with 90/10 split on data, where 90% of 
the data is randomly selected for training and the remaining 
10% is used for testing. 

Using the ’Kernel trick’, we substitute 27 with <I>(27) in 
equations (|26|-(|34|. Then equation (|30|) is re-expressed as 


$ :R d ->• n 

such that <£>(27) £ H. 


n 1 11 

( 35 ) C(0Li) = ^07 - - ^2 a i a jyiyj{®{ x j)i®( x i))- ( 38 ) 

i—1 i : j—l 


From equations ( |30l ) and ([35} we see that the non-linear 
SVM formulation depends on the data only through the 
dot products *f >(27) ■ <&(xj) in H. These dot products are 
generated by a real-valued ‘comparison function’ (called the 
‘Kernel’ function) k : I d x —» R. that generates all the 

pairwise comparisons Kjj = k(Xi,Xj) = $(27) ■ $(27). We 
represent the set of these pairwise similarities as entries in a 
n x n matrix, K. The use of a kernel function implies that 
neither the feature transformation $ nor the dimensionality of 
TL are required to be explicitly known. 

Definition 4: A function k : C x C —> K. is called a positive 
semi-definite kernel if and only if it is: (i) symmetric, that 
is k(xi,Xj ) = k(xj, Xi) for any 27. Xj £ C and (ii) positive 
semi-definite, that is 

n n 

c T Kc = EE CiCjk(xi,Xj) > 0 (36) 

*=i i=i 

for any 27, x 3 £ C where i,j £ {1,2,..., n} and any c £ R" 
i.e. Cj, Cj £ R. (i = 1,2,..., n ) and the n x n matrix K has 
elements K. rj = k(xi,Xj). 


After solving ([32}, the a, (i = 1,2, ...,n) are substituted in 
( |29a| i that we solve for w'j to get 

n 

Vj = i,2, ...d (39) 

i= 1 

where <I)j (a;,) is the j th entry of the $(0:^) transformed data 
point. Since the transformation <1» is not obtained directly we 
never calculate the w' vector explicitly. Nevertheless,we can 
substitute expression ([39} in ([33a} and solve the latter for // 
(when £k = 0 and ak ^ 0) as follows 

n 

b' = 1 - Uk X y ^aiyi{®{xi), ${x k )) (40) 

i—l 

where this result should be independent of which k we use. 
Having the expression ( [39} for the vector w' and the expres¬ 
sion ( |40} for the constant b' we can classify a new data point 
during the testing phase according to 

class(xj) = sgn((w/, $(xj)) + b'). (41) 


The nature of the data we are using strongly suggests that 
our data points are not linearly separable in the original fea¬ 
ture space. Therefore we choose to solve the dual formula¬ 
tion as given by equation ([32} where G is now defined by 
Gij = yiUjk(xi,Xj ) so that we can use the ‘Kernel Trick’. 
Solving the dual problem has the additional advantage of ob¬ 
taining a sparse solution; most of the a.i will be zero (those 
that satisfy 0 < on < C are the support vectors that define 
the hypersurface). For the purpose of our study we used the 
Radial Basis Function (RBF) kernel defined by 

k(xi, Xj) = exp ^ j (37) 

where 7 is a free parameter and a is the standard deviation of 
the Xi. Typically free parameters are calculated by using the 
cross validation method on the data set, meaning that we split 
the data set into several subsets and the optimization problem 
is solved on each subset by using a kernel with a different 
parameter value 7. We then choose the parameter value that 
gives the lowest minimum value of the objective function. It 
has been seen in many previous applications that the value of 
7 giving optimal results was equal to 7 = 1/d = 1/550. To 


From © we see that we are able to calculate the new 
(flat) hypersurface in the new feature (Hilbert) space simply 
through inner products of ($(xj), $(27)). 


V. CONSTRAINED SUBSPACE CLASSIFIER 

The idea in the constrained subspace classifier (CSC) 
method is similar to the idea used in SVM. In the latter the tar¬ 
get was to separate the noise points (or noise vectors) from the 
injection points (or injection vectors) using a hypersurface. In 
the CSC method the idea is to project the noise vectors, rows 
of X[ (equation ([TO}), onto a d \-dimensional subspace Si, (of 
dimensionality d\ < d) of the d-dimensional space (d = 550) 
and also project the injection vectors, rows of X' 2 (equation 
©), onto a subspace S 2 , (of dimensionality da < d). That 
is we seek to find two (optimal) subspaces such that we can 
classify data (ft-map) points according to their distance from 
each subspace: points closer to the subspace S\ are classified 
as ‘noise points’ and points closer to the subspace S 2 are clas¬ 
sified as injection points. 

The optimality of the choice of each subspace depends on 
the chosen basis vectors, the chosen dimensionalities, d\ and 
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^2, of each subspace as well as the relative orientation be¬ 
tween the two subspaces. Each choice corresponds to a given 
variance of the projected data: the closer the variance of the 
projected points is to the variance of the original data set the 
more optimal the subspaces are considered. Of course there 
is a trade-off between optimality and speed so we picked di¬ 
mensionalities d\ = e?2 = 100 for some cases (most powerful 
injections) and d± = g ?2 = 200 for some others (weakest in¬ 
jections). 


A. The projection operator 

Let S be a data space of dimension equal to the number 
of features, d, of the selected dataset (for our study d is the 
dimensionality of the ft-maps after the resolution reduction 
i.e. d = 550). We can always find an orthonormal basis for S 
(using the Gram-Schmidt process) given by 

Ud = {u\,u- 2 ,... ,Ud} with m £ R d Vi = 1,2, ...,d (42) 

i.e. Ud £ R dxd . We seek to find a subspace of S of dimension 
d\ < d. Since reducing the dimensionality brings the data 
points closer to each other, thus reducing the variance, we try 
to reduce the number of features from d to d.\ while trying 
to maintain the variance of the data distribution as high as 
possible. 

To achieve the dimensionality reduction we seek to find a 
projection operator that projects the data points from W l to a 
(dimensionally reduced) subspace W 1 ' of orthonormal basis 
given by 

U dl ={ui,U2,...,u dl } with Ui£R d Vi = 1,2,..., d\ 

(43) 

i.e. U dl £ R dxdl . By definition the projection operator is 
given by 


P = Q(Q T Q)~ 1 Q T (44) 

and projects a vector onto the space spanned by the columns 
of Q. Therefore, we may take the columns of Q to be the 
orthonormal vectors given in ( |43) >, that is Q = U dl . In that 
case, equation <(44]» becomes 

P = U dl (UjU dl )~ 1 Ul (45) 

which is the projection operator onto the space spanned by the 
column vectors of U dl . 

Since equation ( |42| is an orthonormal basis for B d then 
U J dl U dl = I dl . Therefore, the expression of the projection 
operator that can project the (data) vectors in B d onto its sub¬ 
space R dl is given by 


orthonormality of the rows of U d implies U d Uj = I d (i.e. Uj 
is the right inverse of U d ). Therefore, for the special case that 
di = d we have that UJ is the inverse of U d or 

U] = U d \ (47) 


B. Principal component analysis (PCA) 


To introduce PCA we will use the definition of the data ma¬ 
trix X[ as given by equation (101. Using the projection opera¬ 
tor as given by expression (|46|) we want to project the ft-maps 
of X[ in a subspace R dl of R d (d\ < d). Let Xi be the origi¬ 
nal 1 x d row vector in R d . We project the column vector xj 
onto R dl thus defining T, T = U dl U di xJ. Then the norm of 
the difference between the original and the projected (column) 
vectors can be expressed as 


II A - Xi T II = Ik - U^U^xJ || (48) 

where U dl £ R dxdl . In PCA we want to find the subspace 
R dl such that 


~ u diU dl xJII 2 is minimized ^ 
subject to U] h U dl = l dl . 

This subspace R' / is defined as the d \-dimensional hyper¬ 
surface that is spanned by the (reduced) orthonormal basis 
{ui, U 2 , U3 ,..., u dl }. i.e. finding such a basis is equivalent 
to defining the subspace R dl . 

Using the definition of the Lrobenius norm for a mxn matrix 
A, 


u\\f = M 2 = \/ trace ( A * A ) ( 5 °) 

\ »=1 i=t 

where A* is the conjugate transpose of A, we get 

n 

Y, IK T - U dl Ulx}f F = tr {X[\X[(1 - U dl Ul)} (51) 

i=1 


where X[ £ R nxd (where n = 22700 and d = 550 as shown 
in equation (fTO])). Thus the optimization problem in equation 
( |49| ) reduces to |[23l 


X[(l-U dl Ul)} 
subject to U 1 d V dl =T dl . 


(52) 


P = U dl Uj i . (46) 

In case d\ = d then P = U d U d . Since U d is a square ma¬ 
trix whose columns are orthonormal, this implies that its rows 
are also orthonormal. Orthonormality of the columns of U d 
implies UjU d = I d (i.e. Uj is the left inverse of U d ) and 


Since trjXjUYj} is a constant, the optimization problem 
can be re-written as 


if/^x tr { Ufa X[ T X[U dl } 
subject to UJ M U dl = l dl . 


( 53 ) 
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To solve equation ( |53} we define the Lagrangian dual prob¬ 
lem by 


(using the symmetry of A,; ? ) the last two terms of equation 
© can be combined to a single term to get 


=t r (UlX[ T X[U dl )- 

di di d 

— E E! Ai f ^E UjkUki — fiji) 

i=1 j =1 /c—1 


where 5ij = 


1 for i = j 
0 for i ^ j. 


(54) 


Since U d JJ d 1 is a symmetric d\ x di matrix then the or¬ 
thonormality condition in equation ( |53} represents a total of 
d\ x (d\ + l)/2 conditions. Therefore, for the Lagrangian 
dual problem (as shown in equation ( [54] )) we need to intro¬ 
duce d\ x (di + l)/2 Lagrange multipliers A ij. Hence we 
require that X, 7 is a symmetric matrix. Also since each term 
in ( [54] ) involves symmetric matrices then the following first 
order optimality conditions 


dC 

dX 


= 0 and 


pq 


dC 

duL 


= 0. 


(55) 


can be solved for A,; ? only if the latter is symmetric. Using 
equations ([54} and (|55} we get 


dX v 


di d d 

EEE^* 1 *)^- 

i= 1 j=1 k=1 
di di d 

-E E A u (E uj k Uki — Sji ) 

i=i j=i fe=i 


= 0 


(56) 


and 


dU lr 


d\ d d 

EEE 

L i=i j=i 

di di d 

EE^'E'E" 

i=l j=l fc=l 


= 0. 


(57) 


Equation ( |56[ implies the di x (di + l)/2 equations 

d 

J2 UT qk U kp = Sqp (58) 

fc=l 

while equation ( |57] > implies the d x d\ equations 


d d± 

E UJnMFX'Jjl ~ E A ™^ = °‘ (6°) 

j =1 i= 1 

Equations ([60]) and ( |58[ are sufficient to solve for A,, and 
C/fcj. Right-multiplying equation ([60]) by Ui n and summing 
over 1 < l < d we get 

d d d± d 

E E - E A ™ E = °- 

i=i j =i i=i ;=i 

(61) 

Using equation ( |58} then equation ( [67} becomes 

d d 

E E UmSiXfX'JjtUtn = X mn . (62) 

1=1 j=1 

Equations ( [62] ) and ( |60] > represent a set of di x (di + l)/2 and 
di x d equations respectively. These can be solved to obtain 
the di x (di + l)/2 degrees of freedom of A,, and the di x d 
degrees of freedom of (7,/,. 

The left hand side (LHS) of equation ( [62] ) represents the 
a mn elements of a <1\ x d\ matrix and similarly the right 
hand side (RHS) of ( [62] ) represents the X mn elements of an¬ 
other di x di matrix. Equation ([62} implies an entry-by-entry 
equation ( a mn = X mn ) between the two matrices. Choosing 
m = n and summing equation ( [62] ) over 1 < m < d\ implies 
that the sum along the diagonal of the matrix on the LHS is 
equal to the sum along the diagonal of the matrix on the RHS 
or equivalently 

di d d di 

E E E UmSiX^X'JjiUirn = ^ X mm . (63) 

m—1 1=1 j=1 m—1 

Noting that the LHS of ( [63} is the trace of the LHS of ([62} we 
can re-write ( [63} as 

di 

tliU T di X[ T X[U dl ) = J2 X rnm■ (64) 

m= 1 

To interpret the X mm we use a theorem according to which the 
trace of a matrix is equal to the sum of its eigenvalues. There¬ 
fore, we can identify the X mm for 1 < rn < d \ as the eigen¬ 
values of the symmetric matrix {XiU dl ) T (Xi U dl ). However, 
these di eigenvalues are di out of the total d eigenvalues of 
XJ X-\. This can be shown by using the invariance of trace un¬ 
der similarity transformations (in this case under conjugacy). 
Using equation ( [47} we can re-write equation ([64} for <1\ = d 
as 


d d 

E u lA x i x '^ + 

3 =1 k=1 

d\ d\ 

^ ^ ^ ^ A imUli = 0. 

3 = 1 


i= 1 


(59) 


Using the fact that X[ J X[ is symmetric, the first two terms of 
equation ( |59] t can be combined to a single term and similarly 


d 

tr (U^XfXiUd) = ti(X[ T X[) = J2 A mm . (65) 

m—1 

Therefore, the maximum of the objective function F = 
tr(L/J X'-^X'-JJd-i) in expression ([53} is equal to the summa¬ 
tion of the di largest eigenvalues of X] X\ . Therefore the or¬ 
thonormal basis for the lower dimensional subspace is given 
by the set of the eigenvectors corresponding to the di largest 
eigenvalues of the symmetric matrix XjXi. 
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C. Formulation of CSC 

Consider the binary classification problem with X\ £ 
R nxd and X' 2 £ R nxd be the data matrices corresponding 
to two data classes, C\ (noise points) and C2 (injection points) 
respectively. The number of data samples in C\ is the same as 
the number of data samples in C2 and is equal to n/2. The cor¬ 
responding number of features is given by d for both classes 
C\ and C2 (in our case n/2 = 11350 and d = 550). 

We attempt to find two linear subspaces <Si C C, \ and 
S 2 C C 2 that best approximate the data classes. Without loss 
of generality we assume the dimensionality of these subspaces 
to be the same and equal to d-\. Let 


and 


U = [u 1 ,u 2 ,...,u dl \ £ 


V = [vi,v 2 , ■ ■ -,v dl ] £ 


pdxdi 


pdx d\ 


( 66 ) 


(67) 


represent matrices whose columns are orthonormal bases of 
the subspaces S\ and S 2 respectively. If we attempted to 
find <Si independently from S 2 then we would have to capture 
the maximal variance of the data projected onto 5) separately 
from the maximal variance of the data projected onto S 2 . That 
would be equivalent to solving the following two optimization 
problems | 


tr (U J X[ T X[U) 
subject to U T U = I dl 


max 

ue«. dxd 


and 


max tr(V T X' 2 X' 2 V) 

^ gR dx<ii 

subject to V J V = I dl . 


( 68 ) 


(69) 


The solution to the optimization problem as shown in ex¬ 
pression (68} is given by the eigenvectors (the columns of the 
orthonormal basis U of <Si) corresponding to the d\ largest 
eigenvalues of the matrix X^ 1 X[. Similarly, the solution to 


The last term of the objective function G = tr(U T X[ T X[U)+ 
tr(V J X 2 J X 2 V) -\-C\x(U^VV 1 U) is a measure of the relative 
orientation between the two subspaces as defined in (5). The 
parameter C controls the trade off between the relative orien¬ 
tation of the subspaces and the cumulative variance of the data 
as projected onto the two subspaces. For large positive values 
of C, the relative orientation between the subspaces reduces 
(the two subspaces become more ‘parallel’), while for large 
negative values of C, the relative orientation increases (the 
two subspaces become more ‘perpendicular’ to each other). 

This problem is solved using an alternating optimization 
algorithm described in (51. For a fixed V, expression © 
reduces to 


subject to U J U = I, 


d\ •> 


V 1 V = I dA . 


max tr(U J (X[ 1 X[+CVV J )U) 

UG R dxd i 

subject to U J U = I d[ . 


(71) 


The solution to the optimization problem (TT} is obtained 
by choosing the eigenvectors corresponding to the d\ largest 
eigenvalues of the symmetric matrix X' 1 X \ + CV \ n . Sim¬ 
ilarly, for a fixed U, expression (70} reduces to 


max tr(V T (X 2 X 2 + CUU T )V) 

V£R dxdl 

subject to V T V = I dl 


(72) 


ti largest eigenvalues of the symmetric matrix X 2 J X 2 


the optimization problem as shown in expression ( |69[ i is given 
by the eigenvectors (the columns of the orthonormal basis V 
of S 2 ) corresponding to the d-\ largest eigenvalues of the ma¬ 
trix X' 2 X 2 . Though the subspaces Si and S 2 are good ap¬ 
proximations to the two classes C\ and C2 respectively, these 
projections may not be the ideal ones for classification pur¬ 
poses as each one of them is obtained without the knowledge 
of the other. 

In the constrained subspace classifier (CSC) the two sub¬ 
spaces are found simultaneously by considering their relative 
orientation. This way CSC allows for a trade off between 
maximizing the variance of the projected data onto the two 
subspaces and the relative orientation between the two sub¬ 
spaces. The relative orientation between the two subspaces is 
generally defined in terms of the principal angles. The opti¬ 
mization problem in CSC is formulated as follows 

max tr {W X' 1 X[U) + tr(V T X' 2 J X' 2 V) + Cir(UWV J U) 

u,ve K dxdi 


where the solution to the optimization problem ( |72| is again 
obtained by choosing the eigenvectors corresponding to the 

di 

CU 1 U 1 . 

The algorithm for CSC can be summarized as follows: 

Algorithm 2 CSC (X[,X' 2 , d u C) 

1. Initialize U and V such that U T U = I dl , V T V = I dl . 

2. Find eigenvectors corresponding to the di largest eigenvalues 
of the symmetric matrix X'i'X[ + CVV T . 

3. Find eigenvectors corresponding to the di largest eigenvalues 
of the symmetric matrix X' 2 1 X' 2 + CUU 1 . 

4. Alternate between 2 and 3 until one of the termination rules 
below is satisfied. 


We define the following three termination rules: 

• Maximum limit Z on the number of iterations, 

• Relative change in U and V at iteration m and m + 1, 


toiE? — 


tol™ = ' 


||[/(m+l) _ t/(m)|| F 

’ 

y(m+l) _ y(m)|| F 

Vn 


(73) 


where N = dxd± and the subscript F denotes the Frobe- 
nius norm. 

• Relative change in the value of the objective function G 
as shown in expression (70} at iteration m and m+ 1, 


t°l? = 


Q(m+l) _ Q{m) 




1 


(70) 


(74) 
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The value of Z was set to 2000, while tol™, tol™ and toly are 
all set at the same value of 10 6 . From equation ([50} we see 
that the factor of 1/ y/~N in ([73} results in the averaging of the 
squares of all the entries of the matrices (L r< - m+1 ) — U 
or (l/( m+1 ) — y( m )). This regularization factor keeps the 
tolerance values independent of the data set. 

After solving the optimization problem ( [70} (by utilizing 
algorithm 2) a new point x is classified by computing the dis¬ 
tances from the two subspaces 5) and S> defined by 

dist(at, l Si) = tr(C/ T x T x(7) (75) 

and 

dist^,^) = tr(V T .T T ccV). (76) 

The class of x is defined by 

classic) = arg{ min {dist(x, 5,)}}. (77) 

ie{l,2} 

In our case, if x is closer to Si then x is classified as noise 
(or ‘no signal’) and if x is closer to S2 then x is classified as a 
r-mode injection (or ‘presence of signal’). 

VI. RESULTS AND DISCUSSION 

When using the conventional clustering algorithm in 0 the 
false alarm rate (FAR), or false positives, is easily controlled 
by adjusting the SNR threshold above which a ft-map is con¬ 
sidered to include a r-mode signal. This is not the case for the 
MLAs we used where the FAR is given after the training is 
performed as part of the training output. For this reason, to 
draw fair comparisons, we adjusted the FAR of the conven¬ 
tional algorithm to match the output FAR of the MLAs. In 
each of the tables II-VI, the results of the conventional algo¬ 
rithm are presented in 4-tuples: the first entry corresponds to 
a sensitivity result with a fixed FAR equal to 0.1% and the 
second, third and forth entries are results corresponding to the 
same FAR as that of the ANN, SVM and CSC respectively. 
The results presented on tables III, IV and VI are also plotted 
in Fig[4} Fig[5]and Fig[6]respectively. 

In table I we present the results of the conventional algo¬ 
rithm and the three MLAs on the (a = 0.1, f a = 1500 Hz) 
waveform. The latter were trained with data produced with 
h taking values over the range of 10' 24 < h < 10 21 . The 
MLAs did not outperform the conventional algorithm. The 
number of data that was produced was limited due to the finite 
amount of data that was produced during the S5 LIGO run. 
This amount of data was too small for the MLAs to achieve 
generalization, hence the training efficiencies are too low. To 
avoid this the next steps involved training of the same number 
of data over smaller ranges of values of h. 

In table II we present the detection efficiency results for 
the conventional algorithm and the three MLAs on the (a = 
0.1, f 0 = 1500 Hz) waveform. The MLAs were trained 
with data produced with h taking values over the range of 


Iq-23. 2 < h < 10” 22 - 7 . There was not a lot of expectation 
that the MLAs would outperform the conventional algorithms 
because the MLAs were trained on a data set whose injection 
distances were lower than the distance at which the conven¬ 
tional algorithm had a 50% detection efficiency. 

In table III we present the detection efficiency results for 
the conventional algorithm and the three MLAs on the (a = 
0.1, f 0 = 1500 Hz) waveform. The MLAs were trained 
with data produced with h taking values over the range of 
10- 23 - 7 < h < 10~ 23 ' 2 . The training of the MLAs on this 
training set resulted in false alarm rates of 4%, 5% and 10% 
for the ANN, SVM and CSC respectively. At the 50% false 
dismissal rate (FDR), the ANN shows an increase of ~ 20% 
in the detection distance - from 1.5Mpc (of the conventional 
algorithm dash-dot blue line) to 1.8Mpc. The SVM shows an 
increase of ~ 16% - from 1.55Mpc (of the conventional al¬ 
gorithm dash-dot green line) to 1.8Mpc. The CSC shows an 
increase of ~ 10% - from 1.6Mpc (of the conventional algo¬ 
rithm dash-dot red line) to 1.75Mpc. 

In table IV we present the detection efficiency results 
for the conventional algorithm and the three MLAs on the 
(a = 0.1, f 0 = 1500 Hz) waveform. The latter were trained 
with data produced with h taking values over the range of 
10“ 24 0 < h < 10” 23 - 7 . The training of the MLAs on this 
training set resulted in high false alarm rates of 18%, 22% 
and 36% for the ANN, SVM and CSC respectively. At the 
50% FDR, both the ANN and the SVM algorithms show an 
increase of ~ 75% in the detection distance - from 1.6Mpc 
(of the conventional algorithm dash-dot blue and green lines) 
to 2.8Mpc. The CSC shows no increase - both dash-dot and 
solid red lines stay at 50% up to distances of ~ 2.9Mpc. The 
distance range covered in this set has a practical significance 
because it covers: (a) the distance of 3.5 Mpc at which the 
January 2014 supernova occured in M82 and (b) the distance 
of 5 Mpc at which the supernova event rate in the Milky Way 
neighborhood is about 1 every 1-2 years. 

In table V we present the detection efficiency results for 
the conventional algorithm and the three MLAs on the (a = 
0.01, f 0 = 1100 Hz) waveform. This is a much weaker sig¬ 
nal than the (a = 0.1, f 0 = 1500 Hz) as can be seen from 
0 The MLAs were trained with data produced with h taking 
values over the range of 10” 23 7 < h < 10 -23 2 . The MLAs 
did not outperform the conventional algorithms. This was ex¬ 
pected because the MLAs were trained on a data set whose 
injection distances were lower than the distance at which the 
conventional algorithm had a 50% detection efficiency. 

In table VI we present the detection efficiency results for 
the conventional algorithm and the three MLAs on the (a = 
0.01, f 0 = 1100 Hz) waveform. The MLAs were trained 
with data produced with h taking values over the range of 
10” 24 < h < 10 -23 ' 7 . The training of the MLAs on this 
training set resulted in false alarm rates of 18%, 22% and 36% 
for the ANN, SVM and CSC respectively. At the 50% false 
dismissal rate (FDR), the ANN shows an increase of ~ 20% 
in the detection distance - from 175Kpc (of the conventional 
algorithm dash-dot blue line) to 210Kpc. The SVM shows no 
increase while the CSC shows a small increase. 
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TABLE I. Detection efficiencies for injections with waveform parameters a = 0.1 and f 0 = 1500 Hz. The training was performed with 
11350 noise maps and 11350 injection maps. The latter were produced with 10 -3 < a < 10 -1 , 600 < f 0 < 1600 and h values distributed 
in the range of 10 -24 < h < 10 -21 . 


Distance (xll70 Kpc) 

Signal amplitude ( h ) 

Conventional (%) 

ANN (%) 

SVM (%) 

CSC (%) f 

0.1 

4.33 x 10 -23 

100 

100 

100 

0 

0.2 

2.16 x 10" 23 

100 

100 

100 

0 

0.3 

1.44 x 10 -23 

100 

100 

96 

0 

0.4 

1.08 x 10 -23 

100 

83 

47 

0 

0.5 

8.65 x 10" 24 

100 

42 

13 

0 

0.6 

7.21 x 10 -24 

100 

17 

0 

0 

0.7 

6.18 x 10 -24 

100 

3 

0 

0 

0.8 

5.41 x 10 -24 

98 

1 

0 

0 

0.9 

4.81 x 10" 24 

76 

0 

0 

0 

1.0 

4.33 x 10 -24 

50 

0 

0 

0 

1.1 

3.93 x 10" 24 

29 

0 

0 

0 

1.2 

3.61 x 10" 24 

13 

0 

0 

0 

1.3 

3.33 x 10" 24 

4 

0 

0 

0 

1.4 

3.09 x 10" 24 

4 

0 

0 

0 

1.5 

2.88 x 10 -24 

1 

0 

0 

0 


a 1170 Kpc is the distance at which the conventional algorithm detects 50% of the signals with FAR=0.1%. 
b Calculated using <[5} and substituting the parameter values a = 0.1 and /o = 1500Hz and a distance given by the first column. 
c Results are based on detection efficiencies on full resolution maps. 

d Highest training efficiency (99%) with parameter values: momentum=0.9, learning rate=0.02. True positive 98%. False positive: < 0.01% 
e Highest training efficiency (98%) with parameter values: C= 10 3 . True positive: 98 %. False positive: < 0.01%. 
f Highest training efficiency (90%) with parameter values: di=100, C= 1. True positive: 80 %. False positive: < 0.01%. 


TABLE II. Detection efficiencies for injections with waveform parameters a = 0.1 and f 0 = 1500 Hz. The training was performed with 
11350 noise maps and 11350 injection maps. The latter were produced with 10 -3 < a < 10 -1 , 600 < f 0 < 1600 and h values distributed 
in the range of 6.31 x 10 -24 = io -23 2 < h < 10 -22 ' 7 = 2.00 x 10 -23 . The signal amplitudes that lie within this range are in blue text 
while the distance at which the conventional algorithm detects 50% with FAR = 0.1% is in red text. 


Distance (xll70 Kpc) 

Signal amplitude (h) 

Conventional (%) 

ANN (%) 

SVM (%) 

CSC (%) f 

0.1 

4.33 x 10~ 23 

(100, 100, 100, 100) 100 

100 

100 

0.2 

2.16 X 10~ 23 

(100, 100, 100, 100) 100 

100 

100 

0.3 

1.44 x 10~ 23 

(100, 100, 100, 100) 100 

100 

100 

0.4 

1.08 x 10~ 23 

(100, 100, 100, 100) 100 

100 

100 

0.5 

8.65 x 10~ 24 

(100, 100, 100, 100) 100 

100 

96 

0.6 

7.21 x 10~ 24 

(100, 98, 98, 98) 

98 

97 

89 

0.7 

6.18 x 10~ 24 

(100, 97, 97, 97) 

89 

81 

64 

0.8 

5.41 x 10 -24 

(98, 97, 97, 98) 

76 

53 

45 

0.9 

4.81 x 10~ 24 

(76, 96, 87, 96) 

54 

12 

20 

1.0 

4.33 x 10 -24 

(50, 90, 73, 93) 

44 

9 

19 

1.1 

3.93 x 10~ 24 

(29, 67, 45, 69) 

32 

2 

9 

1.2 

3.61 x 10~ 24 

(13, 39,21,41) 

33 

2 

7 

1.3 

3.33 x 10' 24 

(4, 15, 9, 19) 

22 

0 

10 

1.4 

3.09 x 10~ 24 

(4, 8, 4, 11) 

13 

0 

6 

1.5 

2.88 x 10~ 24 

(1,6, 2, 7) 

8 

0 

10 


a 1170 Kpc is the distance at which the conventional algorithm detects 50% of the signals with FAR=0.1%. 
b Calculated using {5} and substituting the parameter values a = 0.1 and /o = 1500Hz and a distance given by the first column. 

c These 4-tuples areaetection efficiencies with FAR= (0.1%, 0.7%, 0.2%, 1%) that were obtained on full resolution maps. The second, third and forth entries 
are to be compared with the ANN, SVM and CSC results respectively. 
d Highest training efficiency (98%) with parameter values: momentum=0.9, learning rate=0.02. True positive 97%. False positive: 0.7% 
e Highest training efficiency (98%) with parameter values: C= 10 3 . True positive: 96%. False positive: 0.2%. 
f Highest training efficiency (95%) with parameter values: di=100, C= 10 3 . True positive: 92%. False positive: 1%. 
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TABLE III. Detection efficiencies for injections with waveform parameters a = 0.1 and f 0 = 1500 Hz. The training was performed with 
11350 noise maps and 11350 injection maps. The latter were produced with 10 -3 < a < 10 -1 , 600 < f 0 < 1600 and h values distributed 
in the range of 2.00 x 10 -24 = 10 -23 ' 7 < h < io -23,2 = 6.31 x 10 -24 . The signal amplitudes that lie within this range are in blue text 
while the distance at which the conventional algorithm detects 50% with FAR = 0.1% is in red text. 


Distance (xll70 Kpc) 

Signal amplitude ( h ) 

Conventional (%) c 

ANN (%) 

SVM (%) 

CSC (%) 

0.1 

4.33 x 10 -23 

(100, 100, 100, 100 

) 100 

100 

100 

0.2 

2.16 x 10~ 23 

(100, 100, 100, 100) 100 

100 

100 

0.3 

1.44 x 10 -23 

(100, 100, 100, 100) 100 

100 

100 

0.4 

1.08 x 10 -23 

(100, 100, 100, 100) 100 

100 

100 

0.5 

8.65 x 10~ 24 

(100, 100, 100, 100) 100 

100 

100 

0.6 

7.21 x 10 -24 

(100, 99, 100, 100) 

100 

100 

100 

0.7 

6.18 x 10~ 24 

(100, 98. 98. 98) 

100 

99 

99 

0.8 

5.41 x 10~ 24 

(98, 98, 98, 98) 

99 

97 

98 

0.9 

4.81 x 10~ 24 

(76, 98, 98, 98) 

97 

92 

94 

1.0 

4.33 x 10" 24 

(50, 97, 98, 98) 

90 

90 

88 

1.1 

3.93 x 10~ 24 

(29, 85, 87, 94) 

88 

90 

77 

1.2 

3.61 x 10~ 24 

(13, 63, 67, 79) 

92 

93 

70 

1.3 

3.33 x 10~ 24 

(4, 41, 55, 64) 

77 

79 

68 

1.4 

3.09 x 10~ 24 

(4, 21, 25, 38) 

74 

74 

67 

1.5 

2.88 x 10~ 24 

(1,9, 13,24) 

69 

70 

55 

1.6 

2.70 x 10~ 24 

(0, 19, 23, 36) 

45 

44 

37 

1.7 

2.55 x 10~ 24 

(0, 19, 23, 30) 

40 

42 

30 

1.8 

2.40 x 10~ 24 

(0, 16, 16, 19) 

45 

49 

35 

1.9 

2.28 x 10~ 24 

(0, 27, 31, 37) 

35 

29 

30 

2.0 

2.16 x 10~ 24 

(0. 17, 21, 30) 

26 

23 

25 

2.1 

2.06 x 10~ 24 

(0. 13. 15, 22) 

33 

27 

23 

2.2 

1.97 x 10~ 24 

(0, 20, 24, 29) 

30 

23 

27 

2.3 

1.88 x 10 -24 

(0. 29, 33. 41) 

30 

29 

26 

2.4 

1.80 x 10 -24 

(0, 16, 22, 33) 

18 

16 

16 

2.5 

1.73 x 10~ 24 

(0, 20, 26, 33) 

30 

19 

20 

2.6 

1.66 x 10~ 24 

(0,3,4, 11) 

8 

6 

8 

2.7 

1.60 x 10~ 24 

(0, 1,2,9) 

16 

11 

19 

2.8 

1.55 x 10~ 24 

(0, 8,10,16) 

15 

12 

17 

2.9 

1.49 x 10~ 24 

(0, 4, 5, 8) 

26 

18 

20 

3.oF1 

1.44 x 10~ 24 

(0, 1, 3,7) 

15 

7 

22 


a 1170 Kpc is the distance at which the conventional algorithm detects 50% of the signals with FAR=0.1%. 
b Calculated using {5} and substituting the parameter values a = 0.1 and /o = 1500Hz and a distance given by the first column. 

c These 4-tuples areaetection efficiencies with FAR= (0.1%, 4%, 5%, 10%) that were obtained on full resolution maps. The second, third and forth entries are 
to be compared with the ANN, SVM and CSC results respectively. 

d Highest training efficiency (88%) with parameter values: momentum=0.9, learning rate=0.02. True positive: 91%. False positive: 4%. 
e Highest training efficiency (89%) with parameter values: C=10 5 . True positive: 88%. False positive: 5%. 
f Highest training efficiency (83%) with parameter values: di=100, C= 10 4 . True positive: 77%. False positive: 10%. 
g This is the 3.5 Mpc distance at which the M82 supernova exploded in January 2014. 




















Plot of table III: Detection efficiencies for the conventional algorithm (dash-dot), and ANN (solid-blue), SVM (solid-green) and CSC (solid-red) 
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FIG. 4. These are the detection efficiencies for the ( f 0 — 1500 Hz, a = 0.1) waveform. This waveform produces the most powerful signal that the 1998 model predicts. This plot 
demonstrates that (when compared for the same FAR) the MLAs performance is at least as good as that of the conventional algorithm. At the 50% false dismissal rate (FDR), the ANN 
shows an increase of ~ 20% in the detection distance - from 1.5Mpc (of the conventional algorithm dash-dot blue line) to 1.8Mpc. The SVM shows an increase of ~ 16% - from 1.55Mpc 
(of the conventional algorithm dash-dot green line) to 1.8Mpc. The CSC shows an increase of ~ 10% - from 1.6Mpc (of the conventional algorithm dash-dot red line) to 1.75Mpc. 
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TABLE IV. Detection efficiencies for injections with waveform parameters a = 0.1 and f 0 = 1500 Hz. The training was performed with 
11350 noise maps and 11350 injection maps. The latter were produced with 10 -3 < a < 10 -1 , 600 < f 0 < 1600 and h values distributed 
in the range of 10 -24 '° < h < 10 -23 ' 7 = 2.00 x 10 -24 . The signal amplitudes that lie within this range are in blue text while the distance at 
which the conventional algorithm detects 50% with FAR = 0.1% is in red text. 


Distance (xll70 Kpc) 

Signal amplitude ( h ) 

Conventional (%) c 

ANN (%) 

SVM (%) 

CSC (%) 

0.4 

1.08 x 10~ 23 

(100, 100, 100, 100 

) 100 

100 

100 

0.5 

8.65 x 10" 24 

(100, 100, 100, 100) 100 

100 

100 

0.6 

7.21 x 10 -24 

(100, 98, 98, 98) 

100 

100 

100 

0.7 

6.18 x 10" 24 

(100, 97, 97, 97) 

100 

100 

98 

0.8 

5.41 x 10 -24 

(98, 98, 98, 98) 

97 

98 

95 

0.9 

4.81 x 10" 24 

(76, 98, 98, 98) 

98 

100 

93 

1.0 

4.33 x 10“ 24 

(50, 98, 98, 98) 

93 

98 

91 

1.1 

3.93 x 10“ 24 

(29, 96, 96, 98) 

93 

98 

83 

1.2 

3.61 x 10" 24 

(13, 92, 92, 96) 

96 

99 

92 

1.3 

3.33 x 10" 24 

(4, 65, 65, 73) 

91 

93 

82 

1.4 

3.09 x 10" 24 

(4, 50, 50, 64) 

94 

92 

81 

1.5 

2.88 x 10 -24 

(1, 37, 37, 49) 

92 

90 

78 

1.6 

2.70 x 10 -24 

(0, 46, 46, 52) 

68 

72 

61 

1.7 

2.55 x 10 -24 

(0, 44, 44, 54) 

70 

70 

61 

1.8 

2.40 x 10" 24 

(0, 35,35,51) 

77 

72 

63 

1.9 

2.28 x 10 -24 

(0, 49, 49, 57) 

55 

61 

51 

2.0 

2.16 x 10“ 24 

(0. 43, 43, 55) 

55 

59 

56 

2.1 

2.06 x 10" 24 

(0, 41, 41, 52) 

62 

61 

61 

2.2 

1.97 x 10 -24 

(0, 42, 42, 50) 

61 

64 

59 

2.3 

1.88 x 10" 24 

(0, 50,50,57) 

61 

61 

59 

2.4 

1.80 x 10 -24 

(0, 45. 45, 53) 

52 

54 

50 

2.5 

1.73 x 10 -24 

(0, 42, 42, 48) 

50 

49 

52 

2.6 

1.66 x 10 -24 

(0. 28. 28, 35) 

55 

37 

44 

2.7 

1.60 x 10 -24 

(0. 21. 21, 40) 

76 

46 

58 

2.8 

1.55 x 10 -24 

(0, 24, 24, 33) 

67 

42 

50 

2.9 

1.49 x 10 -24 

(0. 22, 22, 37) 

61 

48 

59 

3.o[ 

1.44 x 10“ 24 

(0, 14, 14, 26) 

64 

42 

51 

3.1 

1.40 x 10 -24 

(0, 36, 36, 40) 

47 

43 

41 

3.2 

1.35 x 10 -24 

(0, 34, 34. 44) 

41 

39 

41 

3.3 

1.31 x 10“ 24 

(0, 34, 34, 44) 

51 

42 

43 

3.4 

1.27 x 10 -24 

(0, 54, 54, 57) 

40 

41 

40 

3.5 

1.24 x 10 -24 

(0, 42, 42, 57) 

41 

36 

46 

3.6 

1.20 x 10 -24 

(0,41,41,47) 

43 

46 

49 

3.7 

1.17 x 10 -24 

(0, 44, 44, 48) 

43 

44 

46 

3.8 

1.14 x 10 -24 

(0, 54, 54, 63) 

48 

53 

55 

3.9 

1.11 x 10 -24 

(0, 38, 38, 48) 

43 

43 

43 

4.0 

1.08 x 10" 24 

(0, 42, 42. 48) 

35 

36 

44 

4.1 

1.05 x 10 -24 

(0, 32, 32, 39) 

30 

26 

40 

4.2 

1.03 x 10 -24 

(0,21,21,40) 

37 

30 

44 

4. 3 [ 

1.00 x 10" 24 

(0. 23, 23, 32) 

34 

33 

40 

4.4 

9.83 x 10" 25 

(0, 24. 24, 38) 

34 

41 

52 

4.5 

9.61 x 10 -25 

(0, 19, 19,31) 

41 

33 

48 


a 1170 Kpc is the distance at which the conventional algorithm detects 50% of the signals with FAR=0.1%. 
b Calculated using <[5j and substituting the parameter values a = 0.1 and /o = 1500Hz and a distance given by the first column. 

c These 4-tuples areaetection efficiencies with FAR= (0.1%, 18%, 22%, 36%) that were obtained on full resolution maps. The second, third and forth entries 
are to be compared with the ANN, SVM and CSC results respectively. 
d Highest training efficiency (68%) with parameter values: momentum=0.9, learning rate=0.02. True positive: 72%. False positive: 18%. 
e Highest training efficiency (64%) with parameter values: C=10 5 . True positive: 61%. False positive: 22%. 
f Highest training efficiency (60%) with parameter values: di=200, C=10 5 . True positive: 62%. False positive: 36%. 
g This is the 3.5 Mpc distance at which the M82 supernova exploded in January 2014. 

h This is the 5 Mpc distance for which the supernova event rate (in the Milky Way neighborhood) is 1 every 1-2 years 




















Plot of table IV: Detection efficiencies for the conventional algorithm (dash-dot), and ANN (solid-blue), SVM (solid-green) and CSC (solid-red) 


17 



(%) vfouapijp uoipapQ 


FIG. 5. These are the detection efficiencies for the (f 0 = 1500 Hz, a = 0.1) waveform. This waveform produces the most powerful signal that the 1998 model predicts. At the 50% FDR. 
both the ANN and the SVM algorithms show an increase of ~ 75% in the detection distance - from 1.6Mpc (of the conventional algorithm dash-dot blue and green lines) to 2.8Mpc. The 
CSC shows no increase - both dash-dot and solid red lines stay at 50% up to distances of ~ 2.9Mpc. The distance range covered in this set has a practical significance because it covers: (a) 
the distance of 3.5 Mpc at which the January 2014 supernova occured in M82 and (b) the distance of 5 Mpc at which the supernova event rate in the Milky Way neighborhood is about 1 
every 1-2 years. 
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TABLE V. Detection efficiencies for injections with waveform parameters a = 0.01 and f 0 = 1100 Hz. The training was performed with 
11350 noise maps and 11350 injection maps. The latter were produced with 10 -3 < a < 10 _1 , 600 < f 0 < 1600 and h values distributed 
in the range of 2.00 x 10“ 24 = 10“ 23 7 < h < i(p 23,2 = 6.31 x 10 -24 . The signal amplitudes that lie within this range are in blue text 


while the distance at which the conventional algorithm detects 50% with FAR = 0.1% is in red text. 

Distance (x 133 Kpc) 

Signal amplitude ( h ) 

Conventional (%) c 

ANN (%) 

SVM (%) 

CSC (%) 

0.1 

1.50 x 10 -23 

(100, 100, 100, 100 

) 100 

100 

100 

0.2 

7.51 x 10 -24 

(100, 100, 100, 100) 100 

100 

100 

0.3 

5.00 x 10" 24 

(100, 100, 100, 100) 99 

100 

100 

0.4 

3.75 x 10" 24 

(100, 100, 100, 100) 86 

93 

80 

0.5 

3.00 x 10" 24 

(100, 100, 100, 100) 69 

71 

65 

0.6 

2.50 x 10" 24 

(100, 100, 100, 100) 58 

63 

57 

0.7 

2.14 x 10 -24 

(98, 100, 100, 100) 

43 

47 

43 

0.8 

1.88 x 10 -24 

(92, 100, 100, 100) 

38 

50 

37 

0.9 

1.67 x 10 -24 

(79, 100, 100, 100) 

24 

28 

22 

1.0 

1.50 x 10 -24 

(50, 91, 92, 93) 

20 

27 

21 

1.1 

1.36 x 10 -24 

(23, 69, 70, 75) 

24 

7 

14 

1.2 

1.25 x 10 -24 

(4,41,45,55) 

27 

11 

18 

1.3 

1.15 x 10 -24 

(1,32, 35,44) 

14 

12 

16 

1.4 

1.07 x 10 -24 

(1, 19,21,25) 

15 

18 

20 

1.5 

1.00 x 10 -24 

(0, 10, 13, 18) 

14 

7 

21 


a 133 Kpc is the distance at which the conventional algorithm detects 50% of the signals with FAR=0.1%. 
b Calculated using <[5j and substituting the parameter values a = 0.01 and /o = 1100Hz and a distance given by the first column. 

c These 4-tuples areaetection efficiencies with FAR= (0.1%, 4%, 5%, 10%) that were obtained on full resolution maps. The second, third and forth entries are 
to be compared with the ANN, SVM and CSC results respectively. 

d Highest training efficiency (88%) with parameter values: momentum=0.9, learning rate=0.02. True positive: 91%. False positive 4%. 
e Highest training efficiency (89%) with parameter values: C=10 5 . True positive: 88%. False positive 5%. 
f Highest training efficiency (83%) with parameter values: di=100, C=10 4 . True positive: 77%. False positive 10%. 
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TABLE VI. Detection efficiencies for injections with waveform parameters a = 0.01 and f 0 = 1100 Hz. The training was performed with 
11350 noise maps and 11350 injection maps. The latter were produced with 10 -3 < a < 10 -1 , 600 < f 0 < 1600 and h values distributed 
in the range of 10 -24 '° < h < 10 -23 ' 7 = 2.00 x 10 -24 . The signal amplitudes that lie within this range are in blue text while the distance at 
which the conventional algorithm detects 50% with FAR=0.1% is in red text. 


Distance (x 133 Kpc) 

Signal amplitude ( h ) 

Conventional (%) c 

ANN (%) 

SVM (%) 

CSC (%) 

0.1 

1.50 x 10 -23 

(100, 100, 100, 100 

) 100 

100 

100 

0.2 

7.51 x 10 -24 

(100, 100, 100, 100) 100 

100 

100 

0.3 

5.00 x 10" 24 

(100, 100, 100, 100) 100 

100 

100 

0.4 

3.75 x 10" 24 

(100, 100, 100, 100) 91 

97 

85 

0.5 

3.00 x 10" 24 

(100, 100, 100, 100) 90 

92 

75 

0.6 

2.50 x 10 -24 

(100, 100, 100, 100) 85 

89 

79 

0.7 

2.14 x 10" 24 

(98, 100, 100, 100) 

76 

81 

68 

0.8 

1.88 x 10 -24 

(92, 100, 100, 100) 

75 

77 

66 

0.9 

1.67 x 10" 24 

(79, 100, 100, 100) 

63 

62 

57 

1.0 

1.50 x 10 -24 

(50, 95, 95, 95) 

58 

59 

56 

1.1 

1.36 x 10 -24 

(23, 83, 83, 86) 

58 

44 

47 

1.2 

1.25 x 10" 24 

(4, 74, 74, 80) 

77 

49 

59 

1.3 

1.15 x 10" 24 

(1, 52, 52, 62) 

69 

41 

49 

1.4 

1.07 x 10" 24 

(1, 45, 45, 54) 

61 

46 

58 

1.5 

1.00 x 10" 24 

(0, 30, 30, 43) 

60 

39 

50 

1.6 

9.38 x 10 -25 

(0, 30, 30, 40) 

52 

27 

37 

1.7 

8.83 x 10" 25 

(0,21,21,30) 

40 

25 

39 

1.8 

8.34 x 10 -25 

(0, 33, 33, 43) 

43 

39 

45 

1.9 

7.90 x 10 -25 

(0, 32, 32, 43) 

40 

25 

38 

2.0 

7.51 x 10 -25 

(0, 24, 24, 33) 

42 

33 

43 


a Distance at which the conventional algorithm detects 50% of the signals with FAR=0.1%. 

b Calculated using <[5} and substituting the parameter values a = 0.01 and /o = 1100Hz and a distance given by the first column. 

c These 4-tuples areaetection efficiencies with FAR= (0.1%, 18%, 22%, 36%) that were obtained on full resolution maps. The second, third and forth entries 
are to be compared with the ANN, SVM and CSC results respectively. 
d Highest training efficiency (68%) with parameter values: momentum=0.9, learning rate=0.02. True positive: 72%. False positive: 18%. 
e Highest training efficiency (64%) with parameter values: C=10 5 . True positive: 61%. False positive: 22%. 
f Highest training efficiency (60%) with parameter values: di=200, C=10 5 . True positive: 62%. False positive: 36%. 



















Plot of table VI: Detection efficiencies for the conventional algorithm (dash-dot), and ANN (solid-blue), SVM (solid-green) and CSC (solid-red) 
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FIG. 6. These are the detection efficiencies for the ( f 0 = 1100 Hz, a = 0.01) waveform. This signal was proven (in our previous study) to be detectable only at distances that cover the 
Milky Way. This signal is approximately monochromatic for the durations our sensitivity studies were designed. At the 50% false dismissal rate (FDR), the ANN shows an increase of ~ 
20% in the detection distance - from 175Kpc (of the conventional algorithm dash-dot blue line) to 210Kpc. The SVM shows no increase while the CSC shows a very small increase. 
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Original resolution noise map Resolution reduction by a factor of 100 per axis 



FIG. 7. This is one of the noise ft-maps with the original resolution 
of 1000 x 5000 pixels. The pixels along the vertical axis correspond 
to 1Hz each. The pixels along the horizontal axis correspond to 0.5s 
each, hence the total duration of the map is 2500s. The frequency 
cuts are well known seismic frequency bands and suspension vibra¬ 
tion modes. 


FIG. 8. The highest training efficiency for the ML As was achieved 
with resolution reduction by a factor of 100 per axis, (Fig.3). This 
reduced 10 x 50 resolution ft-map corresponds to the full resolution 
noise map in Fig.4. For the resolution reduction we used bicubic 
interpolation as provided by the matlab imresize.m function. The 
frequency cuts were substituted with zeros before reducing the reso¬ 
lution. 


Original resolution injection with a = 0.1 and f 0 = 1500Hz 
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FIG. 9. This is an injection added to the noise ft-map shown in Fig.4. 
The waveform has parameters a = 0.1 and fo = 1500 Hz. The 
duration of the injection is 2500s and corresponds to a distance to the 
source of 117Kpc. Injections at longer distances are harder to see 
by eye in the original resolution maps. The contrast between signal 
pixels and noise pixels is higher in the reduced resolution maps as 
shown in Fig.5. This makes it is easier to see the injections in the 
reduced resolution maps rather than the full resolution ft-maps. 


FIG. 10. This reduced 10 x 50 resolution ft-map corresponds to the 
full resolution map in Fig.4. Despite the 10000 times reduced reso¬ 
lution as compared to the ft-map of Fig.6, the r-mode injection is still 
visible. It turns out that the reduced resolution ft-maps increase the 
training efficiency for the MLAs, according to Fig.3. However, for 
the parameter estimation algorithms we still use the full resolution 
ft-maps. 






















22 


VII. CONCLUSIONS 

• Pipeline suitability: ANN, SVM and CSC (and very likely 

other machine learning algorithms not tested yet) are 
a suitable class of decision making algorithms in the 
search not only for r-mode gravitational waves but in 
the search for long transient gravitational waves in gen¬ 
eral. The results in this paper demonstrate that the 
stochastic pipeline would benefit from utilizing ma¬ 
chine learning algorithms for determining the presence 
of a signal or not. 

• Computational efficiency: The most computationally ex¬ 

pensive part of this study was the production of the one 
set of 11350 noise ft-maps and the 3 sets of 11350 injec¬ 
tion ft-maps (each set requires up to 10 GB of memory 
and up to 1 week on a 50 node cluster). The 3 sets of 
injections examined the 3 different ranges of values for 
h (those correspond to 3 different ranges of values for 
the distance). In practice, we will know the distance 
to the source so we will have to produce only one set 
of injections that will be determined according to that 
distance. 

• Training/testing speeds: Once we have the method (that is 

presented in this paper) the training of the CSC method 
requires 10 minutes, the training of the SVM method re¬ 
quires about 30 minutes while the training of the ANN 
method requires about 8 hours. After the training is 
done the decision making about the presence of a sig¬ 
nal or not takes about 2 seconds for 100 ft-maps. The 
MLAs are much faster when it comes to the decision 
making process than the conventional algorithm is (that 
takes up to 5 minutes for one ft-map). 

• Detection performance: Comparing table II to table III 

and table V to table VI, we see that when the training is 
performed with injections at distances (marked in blue) 
shorter than the distance, d rec i (marked in red) at which 
the conventional algorithm has a 50% success rate, the 
MLAs are not efficient enough and do not outperform 
the conventional algorithm. This is because the MLA 
training sets (for tables II and V) did not include in¬ 
jections corresponding to distances as long as d rec i or 
beyond. This is what we included in the training sets 
for the MLAs whose results are shown in tables III and 
VI. The latter show that when the MLAs are trained 
with signals injected at distances a little shorter than 
d r ed up to distances 1.5-2 times longer than the latter, 
then the MLAs performance is at least as good as that 
of the conventional algorithm. Training the MLAs with 
injections at distances shorter than d re( i was to ensure 
that the MLAs can detect signals injected at distances 
0.7 — 0.8 that of d re d, and training the MLAs with in¬ 
jections at distances longer than d rec i was done in order 
to push the limits of the MLAs and see how much (if at 
all) they can outperform the conventional algorithm. 

• Low detection efficiency: for the (0.01,1100 Hz) wave¬ 

form. Our suspicion for the low detection efficiencies 


for the second waveform (weakest signal) we tested is 
on the resolution reduction factor of 10~ 2 we used. The 
plot in Fig|3]was obtained on a study with the strongest 
signals (0.1,1500 Hz). We have not tested whether the 
weaker signals have maximum training efficiencies at a 
different resolution reduction than the one we used in 
this study. This needs further investigation. 

• False alarm rates: In our study FARs of 4-10% (for tables 

III and V) and 18-36% (for tables IV and VI) are consid¬ 
ered very high, however, a more carefully chosen train¬ 
ing set may result in lower FARs. The first suggestion 
would be to train the MLAs with a higher number of 
noise and injection ft-maps. If that is not possible (due 
to data availability) we may train the MLAs with in¬ 
jections at distances over a range of ( h 2 ) values that is 
smaller than those in the current training sets. Similarly 
we can use smaller ranges of parameter values for a and 
f a . We can also try to increase the ratio of noise maps 
over injection maps in the training set so that the MLAs 
may recognize the noise maps more efficiently. Specif¬ 
ically for the ANN, one way we may try to reduce the 
FAR is by exploring different topologies in the neural 
network architecture. For SVM and CSC we may in¬ 
troduce a cost function to suppress FAR to acceptable 
values. 

• Search optimization: There are many ways that we can 

further optimize the MLAs specifically designed for the 
search of r-modes gravitational radiation. One way is 
by customizing the ft-map resolution reduction. In¬ 
stead of using bicubic interpolation we may use a reso¬ 
lution reduction algorithm specifically designed for the 
r-mode signals so that the averaging is done along the 
r-mode signal curves. Since the r-mode search is a 
targeted search (using a supernova electromagnetic or 
neutrino trigger) the distance to the source can be es¬ 
timated with an accuracy of 10 — 15% |25. 26|. This 
distance range can then be used to produce injection ft- 
maps with which the MLAs will be trained. In this way 
the training can be optimized for the distance of the de¬ 
tectors to the external trigger. 

• Search constraints: Our current method is specifically de¬ 

signed for r-mode gravitational wave searches. A differ¬ 
ent signal (e.g. gravitational waves sourcing from other 
neutron star oscillation modes) would require their own 
training set produced over the specific model parameter 
values. This is a quite different approach than that of the 
conventional algorithm that is generically designed for 
the detection of any type of signal. Our current method 
involves the production of at least 10000 ft-maps (that 
may be overlapping), any amount of data that will not 
be enough for the production of this many ft-maps will 
limit the sensitivity of the search. At the same time the 
higher the number of the ft-maps used for training is 
the more we may increase the training efficiencies of 
the MLAs. 
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• Future developments: Future developments include opti¬ 
mization of the current methods as well as the use of 
other supervised machine learning algorithms such as 
random forests j27l . Random forests can deal with the 
high dimensionality of our data by revealing features 
that contribute very low information to our analysis; 


which can be discarded prior to classification. With re¬ 
spect to the ANN, we plan to train a deep convolutional 
neural network ll28l which appears to be very promising 
for image classification. 
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